Skip to content

nssh: harden session recovery#11

Merged
abizer merged 5 commits into
masterfrom
session-recovery
May 12, 2026
Merged

nssh: harden session recovery#11
abizer merged 5 commits into
masterfrom
session-recovery

Conversation

@abizer
Copy link
Copy Markdown
Owner

@abizer abizer commented May 12, 2026

Summary

Hardens the per-host session-recovery mechanism that the session-recovery checkpoint introduced. Two earlier branch commits stayed, plus four new behaviors:

  • Skip prepareRemote when joining. The original process for the host already wrote the remote session file and seeded the log. Redoing it wastes an SSH and produces a misleading second session-open event.
  • Replay missed messages via ?since=<id>. Track the latest ntfy message id we processed and ask for everything since that point on reconnect — solves the laptop-closed-overnight case directly. Closed-up TCP sockets timeout via deadlineConn, then the next subscribe burst-replays anything posted while we were unreachable.
  • subscribe-up / subscribe-down events in the jsonl with reconnect gap, so silent failures are visible from nssh status --tail.
  • ping / pong liveness probe between nssh peers sharing a topic. pingTopic is a one-shot subscribe + publish + 1.5 s wait — distinguishes "PID is alive" from "PID is alive but wedged".
  • Collision prompt when an existing session is found: [J]oin / [R]eplace (SIGTERM, then SIGKILL after 1 s) / [N]ew topic / [C]ancel. Default depends on liveness. Non-interactive shells join silently with a stderr warning if the peer didn't answer. New --join / --replace / --new flags skip the prompt.
  • nssh sweep <host> new subcommand. Lists mosh-server processes for $USER on the remote (oldest first) and offers to kill them. --all for unattended, --older 168h to keep the last week. Safe with tmux-inside-mosh — the tmux server is its own daemon, sessions detach instead of dying.

Wire-protocol additions: ping and pong kinds. Both local ↔ local. Documented in CLAUDE.md.

Follow-ups (auto-sweep on connect, spare-yourself in --all) are captured in #10.

Test plan

  • just test passes (added parseEtime tests)
  • go vet clean, gofmt clean
  • nssh --help shows new flags + sweep
  • nssh sweep arg validation (no host, conflicting flags, bad duration) behaves
  • Manual: two nssh hostA instances → second one prompts; pick join → both share topic
  • Manual: replace path → first instance gets SIGTERM, second takes over with fresh topic
  • Manual: laptop sleep → ntfy reconnect logs replay count + gap
  • Manual: nssh sweep hostA lists mosh-servers; pick by PID; pick old; SIGKILL escalation

🤖 Generated with Claude Code


Note

Medium Risk
Changes core session startup/reuse logic and ntfy subscription behavior, plus adds a remote process-killing command (sweep), so bugs could disrupt session routing or terminate the wrong process if host/PID detection is off.

Overview
Hardens local session reuse and recovery. nssh <host> now detects existing local sessions for the same canonical host and either joins, replaces (SIGTERM→SIGKILL), or starts a new topic, with --join/--replace/--new flags and an interactive prompt; joining also skips redundant prepareRemote work.

Improves ntfy subscriber resilience. The subscriber tracks the last processed ntfy message id and reconnects with ?since=<id> to replay missed messages, emits new subscribe-up/subscribe-down log events (with gap/since metadata), and adds ping/pong wire kinds used for liveness checks during collision resolution.

Adds maintenance tooling and docs updates. Introduces nssh sweep <host> to list and optionally kill remote mosh-server processes (with --all/--older and tests for ps etime parsing), updates log schema fields (host, joined, reconnect metadata), and refreshes protocol/CLAUDE documentation; .gitignore now ignores /.claude/.

Reviewed by Cursor Bugbot for commit 148702d. Bugbot is set up for automated code reviews on this repo. Configure here.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c78d07791a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread cmd/nssh/main.go Outdated
joinedPID := 0
if cfg.Topic == "" {
cfg.Topic = generateTopic()
existing := findActiveSessionForHost(shortHost)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Key session reuse by full SSH target identity

The collision lookup now uses only shortHost (findActiveSessionForHost(shortHost)), but resolveShortHost normalizes hostnames to the first label (drops the domain) and ignores the SSH user. That causes unrelated targets like alice@db1.prod vs bob@db1.stage (or two different FQDNs sharing db1) to be treated as the same session, so a new run can incorrectly join/replace another connection’s topic and skip remote preparation for the actual target. This can cross-wire clipboard/open traffic and break shims on the second host/user.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Nil dereference when joining without existing session
    • Added explicit nil guards in the join and replace switch arms so the caller no longer relies on resolveSessionCollision's internal nil handling, falling back to a fresh topic if existing is nil.

Create PR

Or push these changes by commenting:

@cursor push 88eb385cd4
Preview (88eb385cd4)
diff --git a/cmd/nssh/main.go b/cmd/nssh/main.go
--- a/cmd/nssh/main.go
+++ b/cmd/nssh/main.go
@@ -666,11 +666,19 @@
 		existing := findActiveSessionForHost(shortHost)
 		switch resolveSessionCollision(existing, collisionFlag) {
 		case "join":
+			if existing == nil {
+				cfg.Topic = generateTopic()
+				break
+			}
 			cfg.Topic = existing.Topic
 			cfg.Server = existing.Server
 			joinedPID = existing.PID
 			fmt.Fprintf(os.Stderr, "nssh: joining active session for %s (PID %d)\n", shortHost, existing.PID)
 		case "replace":
+			if existing == nil {
+				cfg.Topic = generateTopic()
+				break
+			}
 			fmt.Fprintf(os.Stderr, "nssh: replacing existing session for %s (PID %d)\n", shortHost, existing.PID)
 			replaceSession(existing)
 			cfg.Topic = generateTopic()

You can send follow-ups to the cloud agent here.

Comment thread cmd/nssh/main.go Outdated
abizer and others added 3 commits May 11, 2026 19:37
- skip prepareRemote when joining an existing session for the host
  (the original process already wrote the remote session file and
  seeded the log; redoing it costs an SSH and adds a misleading
  duplicate session-open event)
- record `joined: <pid>` on the local session-start event so the log
  distinguishes a fresh connect from a join

subscriber resilience:
- log subscribe-up / subscribe-down events with reconnect gap, so
  silent failures are visible in `nssh status --tail`
- track the most-recent ntfy message id and pass it as ?since=<id>
  on every reconnect — anything posted while we were asleep / on a
  broken connection is replayed instead of dropped on the floor

liveness + collision UX:
- new ping / pong wire kinds and a pingTopic helper (one-shot
  subscribe + publish + wait for ack) to distinguish "PID is alive
  and answering" from "PID is alive but wedged"
- on a session collision, ping the existing peer and prompt
  interactively: [J]oin / [R]eplace (SIGTERM + escalate to SIGKILL)
  / [N]ew topic / [C]ancel. Default is join when alive, replace
  when not. Non-interactive shells silently join with a stderr
  warning if the peer didn't answer.
- new flags --join / --replace / --new to skip the prompt

remote cleanup:
- new `nssh sweep [--all|--older <dur>] <host>` subcommand that
  enumerates mosh-server processes owned by $USER on the remote
  and offers to kill them. Safe with tmux-inside-mosh: tmux server
  is its own daemon, so detached sessions survive.

follow-ups in #10 (auto-sweep on connect, spare-yourself in --all).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@abizer abizer force-pushed the session-recovery branch from c78d077 to ec5dc3a Compare May 12, 2026 02:43
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: New log fields invisible in nssh status --tail
    • Updated formatEvent to render the new host, joined, reconnect, gap, and since log fields in status tail output.

Create PR

Or push these changes by commenting:

@cursor push ce1939e6aa
Preview (ce1939e6aa)
diff --git a/cmd/nssh/status.go b/cmd/nssh/status.go
--- a/cmd/nssh/status.go
+++ b/cmd/nssh/status.go
@@ -324,9 +324,18 @@
 	if e.Exit != nil {
 		fmt.Fprintf(&sb, " exit=%d", *e.Exit)
 	}
+	if e.Gap != "" {
+		fmt.Fprintf(&sb, " gap=%s", e.Gap)
+	}
+	if e.Host != "" {
+		fmt.Fprintf(&sb, " host=%s", e.Host)
+	}
 	if e.ID != "" {
 		fmt.Fprintf(&sb, " id=%s", e.ID)
 	}
+	if e.Joined != 0 {
+		fmt.Fprintf(&sb, " joined=%d", e.Joined)
+	}
 	if e.Kind != "" {
 		fmt.Fprintf(&sb, " kind=%s", e.Kind)
 	}
@@ -339,9 +348,15 @@
 	if e.Persona != "" {
 		fmt.Fprintf(&sb, " persona=%s", e.Persona)
 	}
+	if e.Reconnect {
+		fmt.Fprintf(&sb, " reconnect=%v", e.Reconnect)
+	}
 	if e.Server != "" {
 		fmt.Fprintf(&sb, " server=%s", e.Server)
 	}
+	if e.Since != "" {
+		fmt.Fprintf(&sb, " since=%s", e.Since)
+	}
 	if e.Size > 0 {
 		fmt.Fprintf(&sb, " size=%d", e.Size)
 	}

You can send follow-ups to the cloud agent here.

Comment thread cmd/nssh/log.go
- collapse --join/--replace/--new flag parsing onto one case arm
- flatten the collision switch so the existing==nil branch hoists out
  of all three "case" bodies
- drop the redundant joinedPID guard in session-start (omitempty handles 0)
- replaceSession: extract pidfile path once, single deferred os.Remove
- subscribeNtfy: hoist subscribe-down logging into a markDown closure
  shared by the connect-failure and stream-end paths
- sweep: collapse --older/--older= branches; extract pidsToStrings
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: New log fields invisible in nssh status --tail
    • Added the missing LogEvent fields to status tail formatting so host, joined, reconnect, gap, and since are printed when present.

Create PR

Or push these changes by commenting:

@cursor push 636d9e59b0
Preview (636d9e59b0)
diff --git a/cmd/nssh/status.go b/cmd/nssh/status.go
--- a/cmd/nssh/status.go
+++ b/cmd/nssh/status.go
@@ -324,9 +324,18 @@
 	if e.Exit != nil {
 		fmt.Fprintf(&sb, " exit=%d", *e.Exit)
 	}
+	if e.Gap != "" {
+		fmt.Fprintf(&sb, " gap=%s", e.Gap)
+	}
+	if e.Host != "" {
+		fmt.Fprintf(&sb, " host=%s", e.Host)
+	}
 	if e.ID != "" {
 		fmt.Fprintf(&sb, " id=%s", e.ID)
 	}
+	if e.Joined != 0 {
+		fmt.Fprintf(&sb, " joined=%d", e.Joined)
+	}
 	if e.Kind != "" {
 		fmt.Fprintf(&sb, " kind=%s", e.Kind)
 	}
@@ -339,9 +348,15 @@
 	if e.Persona != "" {
 		fmt.Fprintf(&sb, " persona=%s", e.Persona)
 	}
+	if e.Reconnect {
+		fmt.Fprintf(&sb, " reconnect=%v", e.Reconnect)
+	}
 	if e.Server != "" {
 		fmt.Fprintf(&sb, " server=%s", e.Server)
 	}
+	if e.Since != "" {
+		fmt.Fprintf(&sb, " since=%s", e.Since)
+	}
 	if e.Size > 0 {
 		fmt.Fprintf(&sb, " size=%d", e.Size)
 	}

You can send follow-ups to the cloud agent here.

Comment thread cmd/nssh/status.go
The session-recovery changes added Host / Joined / Reconnect / Gap /
Since to LogEvent but the formatter in status.go was never updated,
so the reconnect-gap and replay-since data that was the whole point
of the visibility work was silently dropped from tail output.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Self-SIGTERM via stale pidfile with PID reuse
    • activeSessions now treats a pidfile for the current PID as stale PID reuse and removes it before collision handling can replace it.

Create PR

Or push these changes by commenting:

@cursor push 1326bdb7db
Preview (1326bdb7db)
diff --git a/cmd/nssh/status.go b/cmd/nssh/status.go
--- a/cmd/nssh/status.go
+++ b/cmd/nssh/status.go
@@ -90,6 +90,7 @@
 	if err != nil {
 		return nil
 	}
+	currentPID := os.Getpid()
 	var out []sessionInfo
 	for _, e := range entries {
 		if e.IsDir() || !strings.HasSuffix(e.Name(), ".json") {
@@ -104,6 +105,11 @@
 		if err := json.Unmarshal(data, &info); err != nil {
 			continue
 		}
+		// A pidfile for our own PID is stale PID reuse; don't let replace kill us.
+		if info.PID == currentPID {
+			_ = os.Remove(path)
+			continue
+		}
 		// Signal 0 is a permission/existence probe on POSIX.
 		if err := syscall.Kill(info.PID, 0); err != nil {
 			_ = os.Remove(path)

diff --git a/cmd/nssh/status_test.go b/cmd/nssh/status_test.go
new file mode 100644
--- /dev/null
+++ b/cmd/nssh/status_test.go
@@ -1,0 +1,42 @@
+package main
+
+import (
+	"encoding/json"
+	"fmt"
+	"os"
+	"path/filepath"
+	"testing"
+	"time"
+)
+
+func TestActiveSessionsDropsStaleSelfPID(t *testing.T) {
+	t.Setenv("XDG_STATE_HOME", t.TempDir())
+
+	dir := sessionsDir()
+	if err := os.MkdirAll(dir, 0o755); err != nil {
+		t.Fatal(err)
+	}
+
+	path := filepath.Join(dir, fmt.Sprintf("%d.json", os.Getpid()))
+	data, err := json.Marshal(sessionInfo{
+		PID:     os.Getpid(),
+		Target:  "devbox",
+		Host:    "devbox",
+		Topic:   "nssh_stale",
+		Server:  defaultServer,
+		Started: time.Now().Add(-time.Hour),
+	})
+	if err != nil {
+		t.Fatal(err)
+	}
+	if err := os.WriteFile(path, data, 0o644); err != nil {
+		t.Fatal(err)
+	}
+
+	if sessions := activeSessions(); len(sessions) != 0 {
+		t.Fatalf("activeSessions() = %d sessions, want 0", len(sessions))
+	}
+	if _, err := os.Stat(path); !os.IsNotExist(err) {
+		t.Fatalf("stale self pidfile still exists: %v", err)
+	}
+}

You can send follow-ups to the cloud agent here.

Reviewed by Cursor Bugbot for commit 148702d. Configure here.

Comment thread cmd/nssh/status.go
}
}
return nil
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-SIGTERM via stale pidfile with PID reuse

Medium Severity

findActiveSessionForHost calls activeSessions(), which considers a PID alive if syscall.Kill(pid, 0) succeeds. If a previous nssh crashed without cleanup and the OS reuses that PID for the current process, the stale session passes the liveness check (it's us!). When the host matches and the user picks "replace" (the default for an unresponsive peer), replaceSession sends syscall.Kill(ourOwnPID, SIGTERM), terminating the current process before the signal handler is installed. Filtering out os.Getpid() in findActiveSessionForHost or activeSessions would prevent this.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 148702d. Configure here.

@abizer abizer merged commit bfa3180 into master May 12, 2026
3 checks passed
@abizer abizer deleted the session-recovery branch May 12, 2026 03:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants